Skip to content

perf: optimize Hunyuan DiT Ulysses and non-attention paths#1200

Open
starrkk wants to merge 6 commits into
ModelTC:mainfrom
starrkk:codex/hunyuan-dit-ulysses-optimizations
Open

perf: optimize Hunyuan DiT Ulysses and non-attention paths#1200
starrkk wants to merge 6 commits into
ModelTC:mainfrom
starrkk:codex/hunyuan-dit-ulysses-optimizations

Conversation

@starrkk

@starrkk starrkk commented Jun 30, 2026

Copy link
Copy Markdown

Summary

  • add optional Hunyuan DiT non-attention torch.compile wrappers controlled by model config
  • add split QKV input/output support for Ulysses attention through explicit config/API parameters
  • add optional async text gather and bounded text gather buffer reuse without inference-operator profiling hooks
  • wire shared QKV activation quantization into the Hunyuan transformer path through model config

Why

These changes reduce Python/tensor layout overhead around HunyuanVideo DiT inference and let Hygon DCU deployments reuse activation quantization for consecutive Q/K/V projections. Runtime choices are now passed through config/API parameters instead of environment-variable switches.

Validation

  • branch rebuilt on latest ModelTC/LightX2V:main (89dfa833)
  • ruff check --config=pyproject.toml passed for the touched files
  • ruff format --check --config=pyproject.toml passed for the touched files
  • python -m py_compile passed for the touched files
  • source check confirmed no LIGHTX2V_* env switches or profiler ranges remain in the touched operator files
  • validated as part of the HunyuanVideo1.5 I2V 8-card benchmark path on Hygon DCU

zhenggf added 3 commits June 30, 2026 11:50
(cherry picked from commit 8f06fb6c7e0859f432a329a84f8d5d8e3a386ad1)
Support split image/text QKV inputs, optional split attention outputs, async text all_gather, and profiler ranges for Ulysses sequence-parallel attention.

(cherry picked from commit 8bb7c3e1784140a8f6d372fe429b468e3a502b8b)
Reuse dynamic activation quantization across consecutive Q/K/V projections and route split image/text tensors through the Ulysses attention path when enabled.

(cherry picked from commit 61c5df5c20106254d5294b910cdf3d1780970a97)

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several performance optimizations for Ulysses attention and Hunyuan Video transformer inference, including support for split QKV inputs/outputs to avoid copy overhead, asynchronous text gathering, buffer reuse, shared dynamic activation quantization, and optional torch.compile support for non-attention branches. The reviewer identified several critical issues: potential NCCL hangs due to overlapping collective operations on the same process group, high compilation overhead and cache thrashing from compiling functions with custom weight objects, an unbounded memory leak in the text gather buffer cache under dynamic prompt lengths, incorrect text mask length calculation when using split QKV inputs, and a potential AttributeError in the shared quantization check if key/value weights lack the expected quantization methods.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread lightx2v/common/ops/attn/ulysses_attn.py Outdated
Comment thread lightx2v/models/networks/hunyuan_video/infer/transformer_infer.py
Comment thread lightx2v/common/ops/attn/ulysses_attn.py Outdated
Comment on lines +128 to +131
if split_qkv_input:
img_qkv_len = img_q.shape[0]
txt_qkv_len = txt_q.shape[0]
txt_mask_len = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When split_qkv_input is True, txt_mask_len is unconditionally set to None. However, if len(cu_seqlens_qkv) == 3, there is a text mask, and txt_mask_len should be computed as cu_seqlens_qkv[2] - slice_qkv_len to ensure correctness for other models or configurations using this generic Ulysses attention implementation.

Suggested change
if split_qkv_input:
img_qkv_len = img_q.shape[0]
txt_qkv_len = txt_q.shape[0]
txt_mask_len = None
if split_qkv_input:
img_qkv_len = img_q.shape[0]
txt_qkv_len = txt_q.shape[0]
txt_mask_len = cu_seqlens_qkv[2] - slice_qkv_len if len(cu_seqlens_qkv) == 3 else None

Comment thread lightx2v/models/networks/hunyuan_video/infer/transformer_infer.py
@starrkk starrkk marked this pull request as ready for review June 30, 2026 09:25
@helloyongyang

Copy link
Copy Markdown
Contributor

Please do not include profiles in inference operators.

Please do not use environment variables as switches.

starrkk commented Jul 1, 2026

Copy link
Copy Markdown
Author

Addressed in 59ddfb8.

  • Removed profiler ranges from the Ulysses attention/all-to-all inference operator paths.
  • Removed LIGHTX2V_* environment-variable switches from the touched operator files.
  • Runtime choices are now passed through model config / explicit API parameters instead of process environment variables.
  • Updated the PR description to reflect the new config/API-based control path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants